DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs
109
• no connection (zero)
• skip connection (identity)
• 3 × 3 dilated convolution with rate 2
• 5 × 5 dilated convolution with rate 2
• 3 × 3 max pooling
• 3 × 3 average pooling
• 3 × 3 depth-wise separable convolution
• 5 × 5 depth-wise separable convolution
We replace the separable convolution depth-wise with a binarized form, i.e., binarized
weights and activations. Skip connection is an identity mapping in NAS, instead of an
additional shortcut. Optimizing BNNs is more challenging than conventional CNNs [77, 199],
as binarization adds additional burdens to NAS. Following [151], to reduce the undesirable
fluctuation in performance evaluation, we normalize the architecture parameter of the M
operations for each edge to obtain the final architecture indicator as
ˆo(i,j)
m
(a(j)) =
exp{α(i,j)
m
}
m′ exp{α(i,j)
m′ }
o(i,j)
m
(a(j))
(4.28)
4.4.4
Tangent Propagation for DCP-NAS
In this section, we propose the generation of the tangent direction based on the Parent
model and then present the tangent propagation to search for the optimized architecture
in the binary NAS effectively. As shown in Fig. 4.12, the novelty of DCP-NAS introduces
tangent propagation and decoupled optimization, leading to a practical discrepancy-based
search framework. The main motivation of DCP-NAS is to “fine-tune” the Child model
architecture based on the real-valued Parent rather than directly binarizing the Parent.
Thus, we first take advantage of the Parent model to generate
the tangent direction from the architecture gradient of the model as
∂˜f C(w, α)
∂α
=
N
n=1
∂pn(w, α)
∂α
,
(4.29)
where ˜f(w, α) is predefined in Eq. 4.21.
Then we conduct the second step, i.e., tangent propagation for the child model.
For each epoch of binary NAS in our DCP-NAS, we inherit weights from the real-valued
architecture ˆα ←α and enforce the binary network to learn distributions similar to those
of real-valued networks.
max
ˆw∈W,ˆα∈A,β∈R+ G( ˆw, ˆα, β) = ˜f P (w, α)log
˜f P (w, α)
˜f C
b ( ˆw, ˆα, β)
=
N
n=1
pn(w, α) log( ˆpn( ˆw, ˆα, β)
pn(w, α) ),
(4.30)
where the KL divergence is used to supervise the binary search process. G( ˆw, ˆα, β) calculates
the similarity of the output logits between the real value network p(·) and the binary network
ˆp(·), where the teacher’s output is already given.
To further optimize the binary architecture, we constrain the gradient of binary NAS
using the tangent direction as
min
ˆα∈A D(ˆα) = ∥∂˜f P (w, α)
∂α
−∂G( ˆw, ˆα, β)
∂ˆα
∥2
2
(4.31)